Skip to content

feat: Maarten/aarch64 spark support#268

Open
mvansegbroeck wants to merge 7 commits intomainfrom
maarten/aarch64-spark-support
Open

feat: Maarten/aarch64 spark support#268
mvansegbroeck wants to merge 7 commits intomainfrom
maarten/aarch64-spark-support

Conversation

@mvansegbroeck
Copy link
Copy Markdown
Collaborator

@mvansegbroeck mvansegbroeck commented Mar 19, 2026

Summary

Add DGX Spark (aarch64) support for Safe Synthesizer:

  • Dockerfile: New containers/Dockerfile.cuda-aarch64 for building on aarch64/Spark
  • Docs: docs/DGX_SPARK.md — quickstart guide for running NSS on DGX Spark
  • Dependencies: Platform-specific dependency guards for aarch64 (torchvision, torchao, xformers); add missing onnxruntime and opt_einsum to cu128 extra
  • Training config: Auto-detect aarch64 and override incompatible defaults (flash attention, quantization)
  • Secret scanning: Add pragma: allowlist secret for API key placeholder in docs

Pre-Review Checklist

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

@mvansegbroeck mvansegbroeck requested review from a team as code owners March 19, 2026 17:37
Copy link
Copy Markdown
Collaborator

@binaryaaron binaryaaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude was a liar about unsloth and aarch64. also, pull in this change and test again

@mvansegbroeck mvansegbroeck force-pushed the maarten/aarch64-spark-support branch 6 times, most recently from 59d13ec to 83fff62 Compare March 24, 2026 01:22
Copy link
Copy Markdown
Collaborator

@binaryaaron binaryaaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to support dgx spark quickly, the different install path in the container is fine but it's not proper support. let's make a follow up for this - we'll need to do a proper spike on cross-platform support + cuda version support.

"vllm==0.15.0",
"xformers==v0.0.33.post2; sys_platform == 'linux'",
"xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'",
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably going to be a whole thing but we should make a new dep group or do better platform resolution for cross-platform deps.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this needs a proper spike. For now the platform markers are minimal and don't break x86_64 right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm do we want to keep the sys_platform markers? locking is borked in CI right now and I've been fighting it by putting linux markers in

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really will need a spark and/or station as part of our github runners so we make sure this continues to work.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what to change here.

binaryaaron
binaryaaron previously approved these changes Mar 24, 2026
Copy link
Copy Markdown
Collaborator

@binaryaaron binaryaaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving conditionally so i don't block while i'm out

@binaryaaron
Copy link
Copy Markdown
Collaborator

We've also got to publish the container somewhere in the future, same with the cuda one.

@mvansegbroeck mvansegbroeck force-pushed the maarten/aarch64-spark-support branch 5 times, most recently from a113a39 to 655ca6c Compare March 24, 2026 21:22
@mvansegbroeck mvansegbroeck changed the title Maarten/aarch64 spark support feat: Maarten/aarch64 spark support Mar 24, 2026
@mvansegbroeck mvansegbroeck force-pushed the maarten/aarch64-spark-support branch 2 times, most recently from 8408cca to 819bdb8 Compare March 26, 2026 00:05
mvansegbroeck and others added 4 commits March 31, 2026 10:03
- containers/Dockerfile.spark: container-based install using nvcr.io/nvidia/vllm:26.02-py3
- docs/DGX_SPARK.md: quick start guide (build + run in 2 steps)
- pyproject.toml: platform markers for aarch64-incompatible packages
  (faiss-gpu-cu12, torchvision+cu128, torchao, xformers)
- config/training.py: auto-fallback Flash Attention 3 to sdpa on aarch64
- vllm_backend.py: handle vllm versions without attention_config kwarg

Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
onnxruntime is required by gliner but not always resolved
transitively on aarch64. opt_einsum is directly imported in
dp_transformers/linear.py but was never declared.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
@mvansegbroeck mvansegbroeck force-pushed the maarten/aarch64-spark-support branch from f3b9988 to c7f1cf7 Compare March 31, 2026 17:04
…lockfile

Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>

# 2. Torch-dependent packages — --no-deps preserves the container's torch/CUDA
RUN pip install --no-deps \
peft accelerate bitsandbytes datasets==4.3.0 trl==0.26.1 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 should we add an extras group for them? mostly to not have these versions be different than what's in the lockfile

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-deps install is intentional to preserve the container's torch/CUDA

"vllm==0.15.0",
"xformers==v0.0.33.post2; sys_platform == 'linux'",
"xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'",
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm do we want to keep the sys_platform markers? locking is borked in CI right now and I've been fighting it by putting linux markers in

"vllm==0.15.0",
"xformers==v0.0.33.post2; sys_platform == 'linux'",
"xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'",
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really will need a spark and/or station as part of our github runners so we make sure this continues to work.

@mvansegbroeck mvansegbroeck force-pushed the maarten/aarch64-spark-support branch from 94bf848 to 2688050 Compare April 7, 2026 16:55
- Move DGX_SPARK.md to docs/developer-guide/ and add to mkdocs nav
- Make DGX Spark doc description Spark-specific
- Remove SSH clone hint (repo is public)
- Update docs link to github pages URL
- Combine nested if into single condition in training.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants